{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Representing Categorical Variables](04.03-Representing-Categorical-Variables.ipynb) | [Contents](../README.md) | [Representing images](04.05-Representing-Images.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Representing Text Features\n",
    "\n",
    "Similar to categorical features, scikit-learn offers an easy way to encode another common\n",
    "feature type, **text features**. When working with text features, it is often convenient to\n",
    "encode individual words or phrases as numerical values.\n",
    "\n",
    "Let's consider a dataset that contains a small corpus of text phrases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sample = [\n",
    "    'feature engineering',\n",
    "    'feature selection',\n",
    "    'feature extraction'\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the simplest methods of encoding such data is by **word count**; for each phrase, we count the occurrences of each word within it. In scikit-learn, this is easily done using\n",
    "`CountVectorizer`, which functions akin to `DictVectorizer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x4 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 6 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vec = CountVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, this will store our feature matrix `X` as a sparse matrix. If we want to manually\n",
    "inspect it, we need to convert it to a regular array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 0, 1, 0],\n",
       "       [0, 0, 1, 1],\n",
       "       [0, 1, 1, 0]], dtype=int64)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...with corresponding feature names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['engineering', 'extraction', 'feature', 'selection']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One possible shortcoming of this approach is that we might put too much weight on words\n",
    "that appear very frequently. One approach to fix this is known as term **frequency-inverse\n",
    "document frequency (TF-IDF)**. What TF-IDF does might be easier to understand than its\n",
    "name, which is basically to weigh the word counts by a measure of how often they appear\n",
    "in the entire dataset.\n",
    "\n",
    "The syntax for TF-IDF is pretty much similar to the previous command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.861037  ,  0.        ,  0.50854232,  0.        ],\n",
       "       [ 0.        ,  0.        ,  0.50854232,  0.861037  ],\n",
       "       [ 0.        ,  0.861037  ,  0.50854232,  0.        ]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vec = TfidfVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "X.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We note that the numbers are now smaller than before, with the third column taking the\n",
    "biggest hit. This makes sense, as the third column corresponds to the most frequent word\n",
    "across all three phrases, `'feature'`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['engineering', 'extraction', 'feature', 'selection']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Representing text features will become important in [Chapter 7](07.00-Implementing-a-Spam-Filter-with-Bayesian-Learning.ipynb), *Implementing a Spam Filter\n",
    "with Bayesian Learning*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Representing Categorical Variables](04.03-Representing-Categorical-Variables.ipynb) | [Contents](../README.md) | [Representing images](04.05-Representing-Images.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
